Goto

Collaborating Authors

 language command



SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control

Wang, Yuxuan, Jiang, Haobin, Yao, Shiqing, Ding, Ziluo, Lu, Zongqing

arXiv.org Artificial Intelligence

Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.


GELATO: Multi-Instruction Trajectory Reshaping via Geometry-Aware Multiagent-based Orchestration

Huang, Junhui, Gong, Yuhe, Li, Changsheng, Duan, Xingguang, Figueredo, Luis

arXiv.org Artificial Intelligence

We present GELATO -- the first language-driven trajectory reshaping framework to embed geometric environment awareness and multi-agent feedback orchestration to support multi-instruction in human-robot interaction scenarios. Unlike prior learning-based methods, our approach automatically registers scene objects as 6D geometric primitives via a VLM-assisted multi-view pipeline, and an LLM translates free-form multiple instructions into explicit, verifiable geometric constraints. These are integrated into a geometric-aware vector field optimization to adapt initial trajectories while preserving smoothness, feasibility, and clearance. We further introduce a multi-agent orchestration with observer-based refinement to handle multi-instruction inputs and interactions among objectives -- increasing success rate without retraining. Simulation and real-world experiments demonstrate our method achieves smoother, safer, and more interpretable trajectory modifications compared to state-of-the-art baselines.


Grounded Reinforcement Learning: Learning to Win the Game under Human Commands

Neural Information Processing Systems

From the RL perspective, it is extremely challenging to derive a precise reward function for human preferences since the commands are abstract and the valid behaviors are highly complicated and multi-modal.



OVITA: Open-Vocabulary Interpretable Trajectory Adaptations

Maurya, Anurag, Ghosh, Tashmoy, Nguyen, Anh, Prakash, Ravi

arXiv.org Artificial Intelligence

--Adapting trajectories to dynamic situations and user preferences is crucial for robot operation in unstructured environments with non-expert users. Natural language enables users to express these adjustments in an interactive manner . We introduce OVIT A, an interpretable, open-vocabulary, language-driven framework designed for adapting robot trajectories in dynamic and novel situations based on human instructions. OVIT A leverages multiple pre-trained Large Language Models (LLMs) to integrate user commands into trajectories generated by motion planners or those learned through demonstrations. OVIT A employs code as an adaptation policy generated by an LLM, enabling users to adjust individual waypoints, thus providing flexible control. Another LLM, which acts as a code explainer, removes the need for expert users, enabling intuitive interactions. The efficacy and significance of the proposed OVIT A framework is demonstrated through extensive simulations and real-world environments with diverse tasks involving spatiotemporal variations on heterogeneous robotic platforms such as a KUKA IIW A robot manipulator, Clearpath Jackal ground robot, and CrazyFlie drone. I. INTRODUCTION Robotic systems have increasingly permeated diverse domains, from industrial automation to service robotics, demanding efficient trajectory generation and adaptation techniques. A fundamental challenge in this context lies in enabling robots to generalize in dynamic and unstructured environments.


CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Glossop, Catherine, Chen, William, Bhorkar, Arjun, Shah, Dhruv, Levine, Sergey

arXiv.org Artificial Intelligence

Figure 1: CAST generates counterfactual action and language labels for uncurated robot trajectory datasets using off-the-shelf VLMs. We use this augmented dataset to train CounterfactualVLA, a navigation policy that can follow complex language instructions in the real world. Abstract -- Generalist robots should be able to understand and follow user instructions, but current vision-language-action (VLA) models struggle with following fine-grained commands despite providing a powerful architecture for mapping open-vocabulary natural language instructions to robot actions. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. T o address this, we present a novel method to augment existing robot datasets by leveraging vision language models to create counterfactual labels. Our method improves the language-following capabilities of VLAs by increasing the diversity and granularity of language grounding for robot datasets by generating counterfactual language and actions. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting visual language navigation experiments in 3 different indoor and outdoor environments. Our experiments demonstrate that counterfactual relabeling, without any additional data collection, significantly improves instruction-following in VLA policies, making them competitive with state-of-the-art methods and increasing success rate by 27% on navigation tasks. Large vision-language models (VLMs) are powerful not only because of their diverse capabilities but also because they can be steered with fine-grained instructions to produce specific outputs. Ideally, powerful generalist robot policies should exhibit the same level of controllability on embodied tasks.



LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning

Shao, Yiyang, Huang, Xiaoyu, Zhang, Bike, Liao, Qiayuan, Gao, Yuman, Chi, Yufeng, Li, Zhongyu, Shao, Sophia, Sreenath, Koushil

arXiv.org Artificial Intelligence

Figure 1: We propose a language-directed humanoid whole-body control framework that translates natural language commands into continuous robot actions through a Conditional V ariational Autoencoder (CV AE). The structured latent space brought by the CV AE enables smooth transitions between diverse and agile behaviors, as shown in the sequence where the robot seamlessly transitions from walking to running, concluding with a hand-waving motion prompted by the corresponding text commands. See more experiments at https://youtu.be/9AN0GulqWwc Abstract --General-purpose humanoid robots are expected to interact intuitively with humans, enabling seamless integration into daily life. Natural language provides the most accessible medium for this purpose. However, translating language into humanoid whole-body motion remains a significant challenge, primarily due to the gap between linguistic understanding and physical actions. In this work, we present an end-to-end, language-directed policy for real-world humanoid whole-body control. Our approach combines reinforcement learning with policy distillation, allowing a single neural network to interpret language commands and execute corresponding physical actions directly. T o enhance motion diversity and compositionality, we incorporate a Conditional V ariational Autoencoder (CV AE) structure. The resulting policy achieves agile and versatile whole-body behaviors conditioned on language inputs, with smooth transitions between various motions, enabling adaptation to linguistic variations and the emergence of novel motions. Please see our website at Lang-WBC.github.io


Trajectory Adaptation using Large Language Models

Maurya, Anurag, Ghosh, Tashmoy, Prakash, Ravi

arXiv.org Artificial Intelligence

Abstract: Adapting robot trajectories based on human instructions as per new situations is essential for achieving more intuitive and scalable human-robot interactions. This work proposes a flexible language-based framework to adapt generic robotic trajectories produced by off-the-shelf motion planners like RRT, A-star, etc, or learned from human demonstrations. We utilize pre-trained LLMs to adapt trajectory waypoints by generating code as a policy for dense robot manipulation, enabling more complex and flexible instructions than current methods. This approach allows us to incorporate a broader range of commands, including numerical inputs. Compared to state-of-the-art feature-based sequence-to-sequence models which require training [1] [2], our method does not require task-specific training and offers greater interpretability and more effective feedback mechanisms. We validate our approach through simulation experiments on the robotic manipulator, aerial vehicle, and ground robot in the Pybullet and Gazebo simulation environments, demonstrating that LLMs can successfully adapt trajectories to complex human instructions.